Syntactic Analysis in the Spoken Dutch Corpus (CGN)
نویسندگان
چکیده
The paper describes the syntactic annotation of the Spoken Dutch Corpus (“Corpus Gesproken Nederlands” or CGN), the Dutch-Flemish project (1998-2003) aiming at the collection, description and annotation of ten million words of spoken Dutch. In the first part, the background of the parsing strategy is discussed, as well as some details concerning the actual implementation of the parsing process. The second part discusses some examples of practical applications of the result of the parsing process.
منابع مشابه
Syntactic Annotation for the Spoken Dutch Corpus Project (CGN)
Of the ten million words of contemporary standard Dutch in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), a selection of one million words of natural spoken language will be annotated syntactically. In the present paper we discuss the tag sets and the annotation procedures that are currently being developed and tested. The annotation tags provide information about syntactic constit...
متن کاملHarvesting Dutch Trees: Syntactic Properties of Spoken Dutch
In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...
متن کاملJASMIN-CGN: Extension of the Spoken Dutch Corpus with Speech of Elderly People, Children and Non-natives in the Human-Machine Interaction Modality
Large speech corpora (LSC) constitute an indispensable resource for conducting research in speech processing and for developing real-life speech applications. In 2004 the Spoken Dutch Corpus (CGN) became available, a corpus of standard Dutch as spoken by adult natives in the Netherlands and Flanders. Owing to budget constraints, CGN does not include speech of children, non-natives, elderly peop...
متن کاملCGN, an annotated corpus of spoken Dutch
Although there are two variants of Dutch, the northern variant being the one used in the Netherlands and the southern variant in Flanders (Belgium), one corpus of spoken Dutch is under construction, the Spoken Dutch Corpus (CGN). In this paper first the principles of this corpus will be discussed, thereafter a few small case studies will show what the merits of such a corpus are.
متن کاملAutomatic Phonemic Labeling and Segmentation of Spoken Dutch
The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of th...
متن کامل